Unstable Cut-Points Based Sample Selection for Large Data Classification
WANG Xizhao1, XING Sheng2,3, ZHAO Shixin2,4
1.College of Mathematics and Information Science, Hebei University, Baoding 071002.2.School of Management, Hebei University, Baoding 071002.3.College of Computer Science and Engineering, Cangzhou Normal University, Cangzhou 061001.4.Department of Mathematics and Physics, Shijiazhuang Tiedao University, Shijiazhuang 050043
Abstract:When the traditional sample selection methods are used to compress the large data, the computational complexity and large time consumption are high. Aiming at this problem, a sample selection method based on unstable cuts for the compression of large data sets is proposed in this paper. The extreme value is obtained at the interval endpoint for convex function, and therefore the endpoint degree of a sample is measured by making the unstable cuts of all attributes according to the basic property. The samples with higher endpoint degree are selected,and the calculation of the distance between the samples is avoided. The efficiency of the computation is improved without affecting the classification accuracy. The experimental results show a significant effect of the proposed algorithm on the compression for the large data set with high imbalance ratio and strong ability of anti-noise.
[1] BRYANT R E, KATE R H, LAZOWSKA E D. Big-Data Computing: Creating Revolutionary Breakthroughs in Commerce, Science, and Society[EB/OL].[2012-10-02]. http://videolectures.net/eswc2012_grobelnik_big_data. [2] WILSON D R, MARTINEZ T R. Reduction Techniques for Instance-Based Learning Algorithms. Machine Learning, 2000, 38(3): 257-286. [3] BRIGHTON H, MELLISH C. Advances in Instance Selection for Instance-Based Learning Algorithms. Data Mining and Knowledge Discovery, 2002, 6(2): 153-172. [4] HART P E. The Condensed Nearest Neighbor Rule. IEEE Trans on Information Theory, 1968, 14(3): 515-516. [5] GATES G W. The Reduced Nearest Neighbor Rule. IEEE Trans on Information Theory, 1972, 18(3): 431-433. [6] RITTER G, WOODRUFF H, LOWRY S, et al. An Algorithm for the Selective Nearest Neighbour Decision Rule. IEEE Trans on Information Theory, 1975, 21(6): 665-669. [7] NIKOLAIDIS K, GOULERMAS J Y, WU Q H. A Class Boundary Preserving Algorithm for Data Condensation. Pattern Recognition, 2011, 44(3): 704-715. [8] GARCA S, DERRAC J, CANO J R, et al. Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study. IEEE Trans on Pattern Analysis and Machine Intelligence, 2012, 34(3): 417-435. [9] ZHAI J H, LI T, WANG X Z. A Cross-Selection Instance Algorithm. Journal of Intelligent and Fuzzy Systems, 2016, 30(2): 717-728. [10] CHEN J N, ZHANG C M, XUE X P, et al. Fast Instance Selection for Speeding up Support Vector Machines. Knowledge-Based Systems, 2013, 45: 1-7. [11] WILSON D L. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Trans on Systems, Man and Cybernetics, 1972, SMC-2(3): 408-421. [12] TOMEK I. An Experiment with the Edited Nearest-Neighbor Rule. IEEE Trans on Systems, Man, and Cybernetics, 1976, SMC-6(6): 448-452. [13] AHA D W, KIBLER D, ALBERT M K. Instance-Based Learning Algorithms. Machine Learning, 1991, 6(1): 37-66. [14] TSAI C F, CHEN Z Y. Towards High Dimensional Instance Selection: An Evolutionary Approach. Decision Support Systems, 2014, 61: 79-92. [15] FU Y F, ZHU X Q, ELMAGARMID A K. Active Learning with Optimal Instance Subset Selection. IEEE Trans on Cybernetics, 2013, 43(2): 464-475. [16] ZHAI T T, HE Z F. Instance Selection for Time Series Classification Based on Immune Binary Particle Swarm Optimization. Know-ledge-Based Systems, 2013, 49: 106-115. [17] TSAI C F, CHANG C W. SVOIS: Support Vector Oriented Instance Selection for Text Classification. Information Systems, 2013, 38(8): 1070-1083. [18] WANG X Z, DONG L C, YAN J H. Maximum Ambiguity-Based Sample Selection in Fuzzy Decision Tree Induction. IEEE Trans on Knowledge and Data Engineering, 2012, 24(8): 1491-1505. [19] FAYYAD U M, IRANI K B. On the Handling of Continuous-Va-lued Attributes in Decision Tree Generation. Machine Learning, 1992, 8(1): 87-102. [20] QUINLAN J R. Induction of Decision Trees. Machine Learning, 1986, 1(1): 81-106. [21] QUINLAN J R. Improved Use of Continuous Attributes in C4.5. Journal of Artificial Intelligence Research, 1996, 4(1): 77-90. [22] BREIMAN L. Technical Note: Some Properties of Splitting Criteria. Machine Learning, 1996, 24(1): 41-47. [23] ROKACH L, MAIMON O. Top-Down Induction of Decision Trees Classifiers-A Survey. IEEE Trans on Systems, Man and Cyberne-tics (Applications and Reviews), 2005, 35(4): 476-487. [24] L J C, YI Z. An Improved Backpropagation Algorithm Using Absolute Error Function // Proc of the 2nd International Symposium on Neural Networks. Berlin, Germany: Springer, 2005, I: 585-590. [25] ZONG W W, HUANG G B, CHEN Y Q. Weighted Extreme Lear-ning Machine for Imbalance Learning. Neurocomputing, 2013, 101: 229-242.